Dataset Description

The dataset provided by Dallas Police department involves the details related to Subject in incidents. The injuries may happen during this course which is also reported. The other details given are Officer and subject race, gender, Officer force type etc. The basic aim of the given data is to analyse the research question if there is any Race effect on the arrests and other crime related incidents involving both parties. We will analyse this question with the help of steps below.

First of all we load the csv file and clean the column alongwith removing columns without any value.

df <- read.csv("~/Documents/R_data_Visualizations/37-00049_UOF-P_2016_prepped.csv",na.strings = c("")) %>%  clean_names() %>% remove_empty()

Overview of data

We can get an overview of the our dataset with the help of head function. It gives us important information about the data types of variables in the dataset which can help to determine what variables we should keep. The information from this very initial step can help for EDA analysis.

ref:

head(df)
##   incident_date incident_time    uof_number officer_id officer_gender
## 1    OCCURRED_D    OCCURRED_T        UOFNum CURRENT_BA         OffSex
## 2        9/3/16    4:14:00 AM         37702      10810           Male
## 3       3/22/16   11:00:00 PM         33413       7706           Male
## 4       5/22/16    1:29:00 PM         34567      11014           Male
## 5       1/10/16    8:55:00 PM         31460       6692           Male
## 6       11/8/16    2:30:00 AM  37879, 37898       9844           Male
##   officer_race officer_hire_date officer_years_on_force officer_injury
## 1      OffRace           HIRE_DT    INCIDENT_DATE_LESS_     OFF_INJURE
## 2        Black            5/7/14                      2             No
## 3        White            1/8/99                     17            Yes
## 4        Black           5/20/15                      1             No
## 5        Black           7/29/91                     24             No
## 6        White           10/4/09                      7             No
##            officer_injury_type officer_hospitalization subject_id subject_race
## 1              OFF_INJURE_DESC              OFF_HOSPIT     CitNum      CitRace
## 2 No injuries noted or visible                      No      46424        Black
## 3                Sprain/Strain                     Yes      44324     Hispanic
## 4 No injuries noted or visible                      No      45126     Hispanic
## 5 No injuries noted or visible                      No      43150     Hispanic
## 6 No injuries noted or visible                      No      47307        Black
##   subject_gender subject_injury          subject_injury_type
## 1         CitSex     CIT_INJURE             SUBJ_INJURE_DESC
## 2         Female            Yes      Non-Visible Injury/Pain
## 3           Male             No No injuries noted or visible
## 4           Male             No No injuries noted or visible
## 5           Male            Yes               Laceration/Cut
## 6           Male             No No injuries noted or visible
##   subject_was_arrested subject_description          subject_offense
## 1           CIT_ARREST          CIT_INFL_A               CitChargeT
## 2                  Yes   Mentally unstable                    APOWW
## 3                  Yes   Mentally unstable                    APOWW
## 4                  Yes             Unknown                    APOWW
## 5                  Yes FD-Unknown if Armed           Evading Arrest
## 6                  Yes             Unknown Other Misdemeanor Arrest
##   reporting_area beat sector      division location_district street_number
## 1             RA BEAT SECTOR      DIVISION         DIST_NAME      STREET_N
## 2           2062  134    130       CENTRAL               D14           211
## 3           1197  237    230     NORTHEAST                D9          7647
## 4           4153  432    430     SOUTHWEST                D6           716
## 5           4523  641    640 NORTH CENTRAL               D11          5600
## 6           2167  346    340     SOUTHEAST                D7          4600
##    street_name street_direction street_type
## 1       STREET         street_g    street_t
## 2        Ervay                N         St.
## 3     Ferguson             NULL         Rd.
## 4 bimebella dr             NULL         Ln.
## 5          LBJ             NULL       Frwy.
## 6    Malcolm X                S       Blvd.
##   location_full_street_address_or_intersection location_city location_state
## 1                               Street Address          City          State
## 2                               211 N ERVAY ST        Dallas             TX
## 3                             7647 FERGUSON RD        Dallas             TX
## 4                             716 BIMEBELLA LN        Dallas             TX
## 5                               5600 L B J FWY        Dallas             TX
## 6                        4600 S MALCOLM X BLVD        Dallas             TX
##   location_latitude location_longitude incident_reason reason_for_force
## 1          Latitude          Longitude      SERVICE_TY       UOF_REASON
## 2         32.782205         -96.797461          Arrest           Arrest
## 3         32.798978         -96.717493          Arrest           Arrest
## 4          32.73971          -96.92519          Arrest           Arrest
## 5              <NA>               <NA>          Arrest           Arrest
## 6              <NA>               <NA>          Arrest           Arrest
##     type_of_force_used1 type_of_force_used2 type_of_force_used3
## 1            ForceType1          ForceType2          ForceType3
## 2 Hand/Arm/Elbow Strike                <NA>                <NA>
## 3           Joint Locks                <NA>                <NA>
## 4     Take Down - Group                <NA>                <NA>
## 5        K-9 Deployment                <NA>                <NA>
## 6        Verbal Command     Take Down - Arm                <NA>
##   type_of_force_used4 type_of_force_used5 type_of_force_used6
## 1          ForceType4          ForceType5          ForceType6
## 2                <NA>                <NA>                <NA>
## 3                <NA>                <NA>                <NA>
## 4                <NA>                <NA>                <NA>
## 5                <NA>                <NA>                <NA>
## 6                <NA>                <NA>                <NA>
##   type_of_force_used7 type_of_force_used8 type_of_force_used9
## 1          ForceType7          ForceType8          ForceType9
## 2                <NA>                <NA>                <NA>
## 3                <NA>                <NA>                <NA>
## 4                <NA>                <NA>                <NA>
## 5                <NA>                <NA>                <NA>
## 6                <NA>                <NA>                <NA>
##   type_of_force_used10 number_ec_cycles force_effective
## 1          ForceType10       Cycles_Num      ForceEffec
## 2                 <NA>             NULL             Yes
## 3                 <NA>             NULL             Yes
## 4                 <NA>             NULL             Yes
## 5                 <NA>             NULL             Yes
## 6                 <NA>             NULL         No, Yes

Table. 1 # EDA Analysis

EDA analysis is an important step for this assignment. We have many data types in our dataframe from characters to double. We will convert the data types to factors and numeric. It will help in data visualization.

At first we get the shape of the data set by dim function.

dim(df)
## [1] 2384   47

The first row is an extra with same names as the datframe. We removed by using in code chunk below.

We also use attach function of base R which will help to call dataset variables without using the basic $ sign each time.

With the help of Exp Data function we can have a closer look into variables types and other important details of the dataset.

library("SmartEDA")
ExpData(data=df,type=1)
##                                           Descriptions      Value
## 1                                   Sample size (nrow)       2383
## 2                              No. of variables (ncol)         47
## 3                    No. of numeric/interger variables          0
## 4                              No. of factor variables          0
## 5                                No. of text variables         47
## 6                             No. of logical variables          0
## 7                          No. of identifier variables          0
## 8                                No. of date variables          0
## 9             No. of zero variance variables (uniform)          4
## 10               %. of variables having complete cases 76.6% (36)
## 11   %. of variables having >0% and <50% missing cases  4.26% (2)
## 12 %. of variables having >=50% and <90% missing cases  2.13% (1)
## 13          %. of variables having >=90% missing cases 17.02% (8)

The above overview is also given below with data types in each column.

library(skimr)
datatable(skim(df))

Variable data types

There are some values in the dataset which will removed periodically in data visualization instead of removing them row by row here.

From the table above we conclude that almost all variables are of data type character which is not helpful for the data analysis via visualization such as boxplot so we will convert the character to factors and dbl to numeric in chunk below.

df <-  lapply(df, as.factor) %>% data.frame()

df$officer_years_on_force <- as.numeric(as.character(df$officer_years_on_force))

df$street_number <- as.numeric(as.character(df$street_number ))
df$sector <- as.numeric(as.character(df$sector))

After getting overall view of the data let’s check the measure of central tendency. It will help to determine and introduce the data statistically.

diagnose_numeric(df)
## # A tibble: 3 × 10
##   variables              min    Q1   mean median    Q3   max  zero minus outlier
##   <chr>                <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <int> <int>   <int>
## 1 officer_years_on_fo…     0     3 8.05e0      6    10    36     3     0     240
## 2 sector                 110   210 3.89e2    350   610   750     0     0       0
## 3 street_number            0  1700 4.90e3   3415  7532 54023     1     0      58

Some columns will be removed which has large number of NaN

df <- df %>% select(-c("uof_number",matches("used")))

The outliers can also eb detected with boxplots.

df %>% filter(subject_gender==c("Male","Female"))%>% 
plot_boxplot(., by ="officer_gender")

We observe that most outliers related to male officers with several years of service. Moreover the average service of officers from both genders is less than 10 years.

With regards to the factor variables such as `officer_injury_type’ we can have detailed description of each incident separately. At first we start with the duplicates detection.

p <- df %>% get_dupes(officer_injury_type)%>%  ggboxplot(x="officer_injury_type",y="dupe_count",color="officer_gender", add="jitter")+theme(axis.text.x = element_text(angle = 90, size = 5))+scale_alpha(0.5)+
ylim(0, 100)
ggplotly(p)

We observe that most of duplicates are the No innjuy for both officer genders. Similar observation can be checked for subjects which shows the similar trend althought the duplicates for abrasive injuries for subjects are higher.

p <- df %>% get_dupes(subject_injury_type)%>% filter(subject_gender!=c(NULL,"Unknown")) %>%  ggboxplot(x="subject_injury_type",y="dupe_count",color="subject_gender", add="jitter")+theme(axis.text.x = element_text(angle = 90, size = 5))+ylim(0, 100)

ggplotly(p)

Duplicates in the dataset for injury type

We can find the number of incidents for time period of each race separately by using tabyl function.

datatable(tabyl(df,incident_time,subject_race) %>%  select(-NULL))

We observe that most incidents are occurring late night and in the morning around 9AM.

Furthermore we can find a summary statistics of categorical variables. We can run several test on the input categorical variables such as chi-square test. The p-value which is basically a statistical check to analyse if there exist significant difference between variables at commonly chosen 5% significance level.

The table generated from the code chunks gives us many insights into the dataset. None of the variables is predictive enough to give us major result about dataset as shown in last column. Althought the p-values are less than 0.05 yet the degree of association is very weak between categorical variables in our dataset as shown by result of chi-square test.

p <-  datatable(ExpCatStat(df,Target="subject_race",result = "Stat",clim=10,nlim=5,Pclass="Yes"))

p

Graphical representation for categorical varaibles in given below.

ExpCatViz(df,target="officer_race",fname=NULL,clim=10,col=NULL,margin=2,Page = c(2,1),sample=2)
## $`0`

ExpCatViz(df,target="subject_injury",fname=NULL,clim=10,col=NULL,margin=2,Page = c(2,1),sample=2)
## $`0`

Above 2 figures show that the percentage of officer getting injury during incidents are high as they are in large percentage in total count of officers. The percentage of injury hispanic officers is almost equal at 19% and 20%. On the other hand there 15% chance of asian officers not able to arrest the subject. Officers are also most likely to be hospitalized alongwith subjects in the incidents involving unfavorable conditions.

We can find the correlation between variables as well which gives results in the form of correlation coefficent. The results show that none of numeric variables are strongly correlated with each other.

correlate(df)
## # A tibble: 6 × 3
##   var1                   var2                   coef_corr
##   <fct>                  <fct>                      <dbl>
## 1 sector                 officer_years_on_force    0.0182
## 2 street_number          officer_years_on_force    0.0410
## 3 officer_years_on_force sector                    0.0182
## 4 street_number          sector                    0.183 
## 5 officer_years_on_force street_number             0.0410
## 6 sector                 street_number             0.183

The above tabular corelation data can be shown in graphical form.

 df %>% 
  correlate() %>%
  plot()

The skewness check of numeric variables is given below.

find_skewness(df)
## [1]  7 24
find_skewness(df, index = FALSE)
## [1] "officer_years_on_force" "street_number"
find_skewness(df, value = TRUE)
## officer_years_on_force                 sector          street_number 
##                  1.484                  0.268                  2.313

So the major numeric variables of on duty years is skewed which need to analysed to remove skewness. Other 2 variables can be reject for skewnss removal since they do not weigh much in the analysis.

Following graphs shows that the officers with more service years are used for crowd control in the department. Furthermore there is very high chance of use of force when subject has weapons. Senior officers will go for severe levels of use of force when the subject is black as shown in boxplot with green fill.

p <- df %>%
 filter(!(subject_race %in% "NULL")) %>%
 filter(!(reason_for_force %in% "NULL")) %>%
 ggplot() +
 aes(x = officer_years_on_force, y = reason_for_force, fill = subject_race) +
 geom_boxplot() +
 scale_fill_hue(direction = 1) +
 ggthemes::theme_base()+theme(legend.position = "bottom")

ggplotly(p)

The chart given below shows that males subjects mostly undergo use of force during arrest as compared to their counterparts. Similar trend is observed for for almost all cases of use of force with more proportion towards males class.

df1 <- df %>% select(subject_gender,reason_for_force,officer_gender)

p <- df1 %>%
 filter(subject_gender %in% c("Female", "Male")) %>%
 filter(!(reason_for_force %in% "NULL")) %>%
 ggplot() +
  aes(x = reason_for_force, fill = subject_gender) +
  geom_bar(position = "dodge") +
  scale_fill_hue(direction = 1) +
  coord_flip() +
  theme_minimal()

ggplotly(p)

Summary

Data Analysis is conducted for the Dallas, USA Police enquity dataset in view of the racial recognition effects. The analysis shows that major portion of both classes of officer genders can go uninjured during incidents while this is not the case of subjects. Black race subjects are involved in high percentage in the incidents followed by hispanics. Statistical analysis of categorical varaibles show that there is variable which can serve as predictve variable. Similarly, the male officers, which consist of more 70% of officers in service, can have skewness in their data on the basis of service years. Normality of dataset was checked which gave us a p-value < 0.05 at 95% confidence level. The corelation matrix shows that none of numeric variables is correlated to each other. The statistical tables show that asian officers have less chance of completing arrests in incidents as compared to white police officers.